studio: extend offline DNS auto-detect to inference parent + training#5512
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements an offline fallback mechanism for Hugging Face Hub interactions by introducing DNS probing and local cache resolution for GGUF models. These changes prevent long network timeouts when huggingface.co is unreachable by automatically enabling offline modes and short-circuiting API calls. Feedback focuses on improving the implementation's thread safety by avoiding global socket timeout mutations, optimizing redundant path operations during cache scanning, and centralizing duplicated helper logic for DNS probing and environment checks.
| prev = socket.getdefaulttimeout() | ||
| socket.setdefaulttimeout(timeout) | ||
| try: | ||
| try: | ||
| socket.gethostbyname(host) | ||
| return False | ||
| except Exception: | ||
| return True | ||
| finally: | ||
| socket.setdefaulttimeout(prev) |
There was a problem hiding this comment.
Modifying socket.setdefaulttimeout is not thread-safe as it affects the entire process. In the FastAPI parent process, concurrent requests could interfere with each other's timeout settings. A safer, thread-safe alternative for a reachability check is to use socket.create_connection((host, 443), timeout=timeout), which respects the timeout without mutating global state and supports both IPv4 and IPv6 addresses.
References
- Use socket.getaddrinfo() or high-level wrappers like socket.create_connection() to create sockets that support both IPv4 and IPv6 addresses, instead of hardcoding an address family.
| matches = sorted( | ||
| p.relative_to(snap).as_posix() | ||
| for p in snap.rglob("*.gguf") | ||
| if "mmproj" not in p.name.lower() | ||
| and boundary.search(p.relative_to(snap).as_posix().lower()) | ||
| ) |
There was a problem hiding this comment.
p.relative_to(snap).as_posix() is a relatively expensive path operation and is currently called twice for every file found by rglob. To improve efficiency, compute this value once and reuse the result.
matches = sorted(
rel
for p in snap.rglob("*.gguf")
if "mmproj" not in p.name.lower()
and (rel := p.relative_to(snap).as_posix())
and boundary.search(rel.lower())
)References
- When a condition or calculated value is used across multiple conditional branches, compute it once and reuse the result to ensure consistency and improve maintainability.
- To improve efficiency, avoid redundant data iterations and transformations.
| if "HF_HUB_OFFLINE" not in os.environ: | ||
| import socket as _socket | ||
|
|
||
| prev_timeout = _socket.getdefaulttimeout() | ||
| _socket.setdefaulttimeout(2.0) | ||
| try: | ||
| _socket.gethostbyname("huggingface.co") | ||
| except Exception: | ||
| os.environ["HF_HUB_OFFLINE"] = "1" | ||
| os.environ.setdefault("TRANSFORMERS_OFFLINE", "1") | ||
| logger.warning( | ||
| "huggingface.co unreachable; HF_HUB_OFFLINE=1 set for this worker." | ||
| ) | ||
| finally: | ||
| _socket.setdefaulttimeout(prev_timeout) |
There was a problem hiding this comment.
The DNS probe logic here is duplicated from _probe_dns_dead in core/inference/llama_cpp.py. You should centralize this logical check into a single helper function and reuse it across the codebase to ensure consistency and simplify maintenance.
References
- Centralize recurring or complex logical checks into a single helper function and reuse it across the codebase to ensure consistency and simplify maintenance.
| def _env_offline() -> bool: | ||
| """True if HF_HUB_OFFLINE or TRANSFORMERS_OFFLINE is set to a truthy value.""" | ||
| return os.environ.get("HF_HUB_OFFLINE", "").lower() in ( | ||
| "1", | ||
| "true", | ||
| "yes", | ||
| ) or os.environ.get("TRANSFORMERS_OFFLINE", "").lower() in ("1", "true", "yes") |
There was a problem hiding this comment.
The _env_offline helper is duplicated in both utils/models/model_config.py and utils/transformers_version.py. Since model_config.py already imports from transformers_version.py, you should centralize this logic in transformers_version.py and import it here to improve maintainability and avoid logic drift.
References
- Centralize recurring or complex logical checks into a single helper function and reuse it across the codebase to ensure consistency and simplify maintenance.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 26053dd50b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # so offline loads don't burn ~25s waiting for the HF API to time out. | ||
| # If the repo really is a LoRA, the worker still resolves it from | ||
| # cache later via the same env var. | ||
| if not is_lora and not is_local and not _env_offline(): |
There was a problem hiding this comment.
Detect cached LoRA adapters while offline
When HF_HUB_OFFLINE/TRANSFORMERS_OFFLINE is set for a cached remote LoRA repo, this skips the only path that flips is_lora to true; from_identifier() then falls through as if the adapter repo were a full base model, so the later LoRA block never downloads/reads cached adapter_config.json and the inference worker receives is_lora=False. This regresses offline loading of already-cached HF LoRA adapters; the offline path should still check the local HF cache for adapter_config.json (or otherwise honor cached adapter metadata) instead of bypassing detection entirely.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dcb30f1c94
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| socket.setdefaulttimeout(timeout) | ||
| try: | ||
| try: | ||
| socket.gethostbyname(host) |
There was a problem hiding this comment.
Bound DNS probes with a real timeout
When a user's resolver is blackholed or very slow, this new probe can block the FastAPI load path well beyond the intended 2 seconds because socket.setdefaulttimeout() does not bound socket.gethostbyname() name-resolution time. In that environment, every model load that enters _hf_offline_if_dns_dead() can hang before it reaches the cached/offline fallback, so use a resolver strategy with an actual deadline instead of relying on the socket default timeout.
Useful? React with 👍 / 👎.
| if "HF_HUB_OFFLINE" in os.environ: | ||
| yield False |
There was a problem hiding this comment.
Keep offline scopes isolated per request
When two load requests overlap while DNS is down, the second one can enter here while the first request's context has temporarily set HF_HUB_OFFLINE; it then yields without owning/restoring the setting, and the first request's finally later pops the variable while the second request is still running. Because environment variables are process-wide, that second request can unexpectedly fall back to normal HF network calls/timeouts mid-scope; guard this with a lock/refcount or avoid using process-wide env mutation for per-request state.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 84e3782ace
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return os.environ.get("HF_HUB_OFFLINE", "").lower() in ( | ||
| "1", | ||
| "true", | ||
| "yes", | ||
| ) or os.environ.get("TRANSFORMERS_OFFLINE", "").lower() in ("1", "true", "yes") |
There was a problem hiding this comment.
Honor
ON in offline environment flags
When users set HF_HUB_OFFLINE=ON or TRANSFORMERS_OFFLINE=ON (standard truthy values for Hugging Face/Transformers), this helper returns False, so the new offline short-circuits below are skipped and _check_tokenizer_config_needs_v5 / _check_config_needs_550 still perform raw urllib fetches that can wait up to 10 seconds each before failing. This undermines the offline path added here for a supported env spelling; include on in the truthy set, and keep the duplicate offline helper consistent.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ec15082487
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| prev = socket.getdefaulttimeout() | ||
| socket.setdefaulttimeout(timeout) | ||
| try: |
There was a problem hiding this comment.
Avoid mutating process-wide socket defaults in DNS probe
_probe_dns_dead() changes the global socket default timeout via socket.setdefaulttimeout(timeout) before resolving huggingface.co. Because this helper is used on the FastAPI parent load path, concurrent requests in the same process can create sockets during this window and unexpectedly inherit the 2s default, causing unrelated outbound calls to fail or time out early. Please switch to a probe mechanism that enforces its own deadline without touching process-wide socket defaults.
Useful? React with 👍 / 👎.
ec15082 to
cf0fc65
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 003761caf2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| t = threading.Thread(target = _probe, daemon = True) | ||
| t.start() | ||
| t.join(timeout) | ||
| # Thread still running -> resolver wedged -> treat as dead. | ||
| return True if result[0] is None else result[0] |
There was a problem hiding this comment.
Reuse a bounded DNS probe thread
When DNS resolution hangs (the _probe thread never returns), join(timeout) exits but the daemon thread keeps running, and each call to this helper spawns another stuck thread. Because _hf_offline_if_dns_dead() is now used on model-load paths in the long-lived inference parent process, repeated loads during a resolver outage can accumulate blocked threads and eventually hit thread/memory limits, degrading or breaking further requests.
Useful? React with 👍 / 👎.
| if "HF_HUB_OFFLINE" in os.environ: | ||
| yield False | ||
| return |
There was a problem hiding this comment.
Treat HF_HUB_OFFLINE as enabled only when truthy
This guard checks only for variable presence, so HF_HUB_OFFLINE=0 (a common explicit default) disables the DNS auto-detect path entirely. In that configuration, dead-DNS loads bypass the new offline short-circuit and fall back to slow network timeouts again, which defeats the regression fix this change is introducing.
Useful? React with 👍 / 👎.
Two follow-ups from the review pass on unslothai#5512: * ModelConfig.from_identifier no longer skips the remote LoRA-detect hf_model_info call when _env_offline() is true. huggingface_hub short-circuits the call via OfflineModeIsEnabled in ~0ms when HF_HUB_OFFLINE is set, so the original 25s concern was moot once routes/inference.py wrapped the call in _hf_offline_if_dns_dead. Skipping the API meant users with a cached LoRA adapter (adapter_config.json on disk) got is_lora=False and the load failed. After the API call (which raises fast offline) a new cache-fallback walks the HF cache snapshot for adapter_config.json via the existing _iter_hf_cache_snapshots helper. * test_hf_model_info_not_called_when_offline replaced. The old test raised AssertionError inside production code that catches Exception, so it passed even if the call happened. New tests use MagicMock and assert call_count >= 1, plus a fixture that stages a fake HF cache with adapter_config.json to verify the offline cache detection. Test count goes from 7 to 8 in test_offline_inference_parent.py. Combined with test_offline_gguf_cache_fallback.py: 34 pass in 9.75s.
Two follow-ups from the review pass on #5512: * ModelConfig.from_identifier no longer skips the remote LoRA-detect hf_model_info call when _env_offline() is true. huggingface_hub short-circuits the call via OfflineModeIsEnabled in ~0ms when HF_HUB_OFFLINE is set, so the original 25s concern was moot once routes/inference.py wrapped the call in _hf_offline_if_dns_dead. Skipping the API meant users with a cached LoRA adapter (adapter_config.json on disk) got is_lora=False and the load failed. After the API call (which raises fast offline) a new cache-fallback walks the HF cache snapshot for adapter_config.json via the existing _iter_hf_cache_snapshots helper. * test_hf_model_info_not_called_when_offline replaced. The old test raised AssertionError inside production code that catches Exception, so it passed even if the call happened. New tests use MagicMock and assert call_count >= 1, plus a fixture that stages a fake HF cache with adapter_config.json to verify the offline cache detection. Test count goes from 7 to 8 in test_offline_inference_parent.py. Combined with test_offline_gguf_cache_fallback.py: 34 pass in 9.75s.
003761c to
47b9e8c
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d7f94471ec
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # from_pretrained / snapshot_download / load_dataset calls resolve | ||
| # from cache instead of burning ~25s per call on retries. Scoped to | ||
| # this subprocess only (orchestrator spawns a fresh worker per run). | ||
| if "HF_HUB_OFFLINE" not in os.environ: |
There was a problem hiding this comment.
Run DNS probe when HF_HUB_OFFLINE is falsy
This guard treats any presence of HF_HUB_OFFLINE as an opt-out, so common explicit defaults like HF_HUB_OFFLINE=0 skip the auto-detect path entirely. In that configuration, a dead resolver no longer flips the worker to offline mode and training falls back to slow Hugging Face/network timeouts, which defeats the latency regression fix this block is meant to provide for cached models.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6030709a47
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if _env_offline(): | ||
| _tokenizer_class_cache[model_name] = False | ||
| return False |
There was a problem hiding this comment.
Read cached configs before defaulting offline
When HF_HUB_OFFLINE is set for a cached remote model whose required transformers tier is only discoverable from tokenizer_config.json/config.json (for example a forked Gemma4-style repo without the hard-coded name substring), this new early return skips the cached HF snapshot and forces _check_tokenizer_config_needs_v5 to False; the duplicated guard in _check_config_needs_550 does the same. activate_transformers_for_subprocess() then selects the default 4.57.x tier even though the needed metadata is already on disk, so offline training/inference of those cached models can fail with the wrong transformers version instead of avoiding only the network fetch.
Useful? React with 👍 / 👎.
#5505 fixed the GGUF/llama-server load path. Studio still has two adjacent code paths that burn ~30-60s of soft-failed timeouts before the worker subprocess starts when DNS to huggingface.co is dead and the model is already in the local HF cache. Inference parent process (routes/inference.py:load_model): * ModelConfig.from_identifier now runs inside _hf_offline_if_dns_dead so the LoRA-detect hf_model_info call and the urllib config probes in utils/transformers_version.py short-circuit when DNS is dead. * utils/models/model_config.py: extracted the inline HF_HUB_OFFLINE/ TRANSFORMERS_OFFLINE check used by list_gguf_variants and detect_gguf_model_remote into a shared _env_offline() helper, then reused it to gate the LoRA-detect hf_model_info call. * utils/transformers_version.py: _check_tokenizer_config_needs_v5 and _check_config_needs_550 now early-return False when offline instead of issuing a 10s urllib.urlopen against huggingface.co/raw/main. Training worker (core/training/worker.py:run_training_process): * Add the same 2s DNS probe used by core/inference/worker.py at the top of the training subprocess. On failure, set HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, and HF_DATASETS_OFFLINE before the rest of the subprocess imports torch/transformers/unsloth, so every from_pretrained, snapshot_download, and load_dataset call below resolves from cache. Scope is per-subprocess; the orchestrator always spawns a fresh worker per training run. Training trainer (core/training/trainer.py:load_model): * Skip the proactive hf_model_info gated-repo probe when _env_offline() is true. The API is unreachable anyway, and a gated model that is already cached is exactly the scenario the user is trying to train against. from_pretrained surfaces the real error if access is actually denied. Tests (tests/test_offline_inference_parent.py, 7 new cases): * _env_offline truthy/falsy parsing across HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE. * transformers_version urllib short-circuit when offline. * LoRA detect hf_model_info skip when offline. Existing tests/test_offline_gguf_cache_fallback.py still passes (26 cases) because the inline env check was extracted, not changed.
The studio test stub convention only included the 6 httpx exception names that existed callers needed. Newer huggingface_hub (1.15+) imports HTTPError, Response, Request, HTTPStatusError, AsyncClient, and more at module import time. When httpx is truly absent the stub chase becomes a treadmill. Use the real package when installed (the CI install list already includes httpx, so this is the production environment). Fall back to the stub only when httpx is genuinely missing. No code under test changes.
Two follow-ups from the review pass on #5512: * ModelConfig.from_identifier no longer skips the remote LoRA-detect hf_model_info call when _env_offline() is true. huggingface_hub short-circuits the call via OfflineModeIsEnabled in ~0ms when HF_HUB_OFFLINE is set, so the original 25s concern was moot once routes/inference.py wrapped the call in _hf_offline_if_dns_dead. Skipping the API meant users with a cached LoRA adapter (adapter_config.json on disk) got is_lora=False and the load failed. After the API call (which raises fast offline) a new cache-fallback walks the HF cache snapshot for adapter_config.json via the existing _iter_hf_cache_snapshots helper. * test_hf_model_info_not_called_when_offline replaced. The old test raised AssertionError inside production code that catches Exception, so it passed even if the call happened. New tests use MagicMock and assert call_count >= 1, plus a fixture that stages a fake HF cache with adapter_config.json to verify the offline cache detection. Test count goes from 7 to 8 in test_offline_inference_parent.py. Combined with test_offline_gguf_cache_fallback.py: 34 pass in 9.75s.
Same fix as #5505's _probe_dns_dead refactor: run gethostbyname on a daemon thread with join timeout so concurrent sockets in the parent interpreter never inherit a process-wide socket.setdefaulttimeout mutation. Adds a static-pin regression test that the inference parent file does not regress on this.
for more information, see https://pre-commit.ci
Shorten the longer explanatory comments added by this PR while keeping the WHY of each non-obvious branch: - trainer.py: collapse the 5-line proactive gated-check comment. - training/worker.py: trim the offline auto-detect preamble and the "logger isn't configured" note. - routes/inference.py: shorten the DNS-probe wrap rationale. - transformers_version.py: collapse the two urllib short-circuit notes. - model_config.py: shorten the LoRA detect + cache-fallback notes. - tests/test_offline_inference_parent.py: tighter module docstring, trim class docstrings, drop multi-line explainer comments inside the tests; behaviour and coverage unchanged (9/9 tests still pass).
6030709 to
72ff31a
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 72ff31a15d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| # Offline: skip the 10s urllib fetch (fail-open to lower tier). | ||
| if _env_offline(): | ||
| _config_needs_550_cache[model_name] = False |
There was a problem hiding this comment.
Don't cache transient offline tier defaults
When _hf_offline_if_dns_dead() temporarily enables offline mode for a parent-process load and this check runs for a remote model without a local config.json, this stores False in the process-wide _config_needs_550_cache. Because the cache is consulted before any later local/network read, the same model remains classified as not needing transformers 5.5.0 even after DNS recovers, so the inference orchestrator/vision detection can keep spawning workers with the wrong transformers tier. The offline shortcut should avoid populating the normal cache, or key the cached value on the offline state.
Useful? React with 👍 / 👎.
Summary
Follow-up to #5505. That PR fixed the GGUF/llama-server load path. Two adjacent Studio code paths still burn 30-60s of soft-failed network timeouts before the worker subprocess starts when DNS to huggingface.co is dead and the model is already in the local HF cache. This PR extends the same DNS auto-detect helper to both.
Inference parent process (FastAPI side, before worker spawn)
routes/inference.py:load_modelnow runsModelConfig.from_identifierinside_hf_offline_if_dns_deadso the soft-failed network calls reached transitively from there short-circuit on dead DNS:utils/models/model_config.pyLoRA-detecthf_model_info(identifier, token=...)call (was ~25s timeout)utils/models/model_config.pyhf_hub_download(identifier, 'adapter_config.json', ...)for remote LoRAs (was ~25s timeout, now bails fast viaLocalEntryNotFoundError)utils/transformers_version.py_check_tokenizer_config_needs_v5rawurllib.urlopen(...)/tokenizer_config.json(was ~10s timeout)utils/transformers_version.py_check_config_needs_550rawurllib.urlopen(...)/config.json(was ~10s timeout)The inline env-var check used by
list_gguf_variantsanddetect_gguf_model_remote(added in #5505) is extracted into a shared_env_offline()helper to avoid duplicating the truthy-value parsing across new call sites.Training subprocess (core/training/worker.py)
run_training_processnow mirrors the DNS auto-detect already incore/inference/worker.py. On dead DNS, it setsHF_HUB_OFFLINE,TRANSFORMERS_OFFLINE, andHF_DATASETS_OFFLINEbefore importing torch/transformers/unsloth, so everyfrom_pretrained,snapshot_download, andload_datasetcall further down resolves from cache. Scope is per-subprocess (the orchestrator spawns a fresh worker per training run).core/training/trainer.py:load_modelskips the proactivehf_model_infogated-repo probe when_env_offline()is true. The API is unreachable, and a gated model that is already cached is exactly the scenario the user is trying to train against.from_pretrainedsurfaces the real error if access is actually denied.Behaviour
HF_HUB_OFFLINE=1orTRANSFORMERS_OFFLINE=1: preserved end-to-end (the contextmanager already respects this from studio: load cached GGUF models when fully offline #5505).Test plan
studio/backend/tests/test_offline_inference_parent.py: 7 new cases covering_env_offline()parsing,transformers_versionurllib short-circuit, LoRA-detect API skip.studio/backend/tests/test_offline_gguf_cache_fallback.py: 26 existing cases still pass after the env-check extraction.Stacks on #5505. Recommend merging that first.